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Abstract 

Background: Recent innovations in sequencing technologies have provided researchers with the ability to rapidly 
characterize the microbial content of an environmental or clinical sample with unprecedented resolution. These 
approaches are producing a wealth of information that is providing novel insights into the microbial ecology of 
the environment and human health. However, these sequencing-based approaches produce large and complex 
datasets that require efficient and sensitive computational analysis workflows. Many recent tools for analyzing 
metagenomic-sequencing data have emerged, however, these approaches often suffer from issues of specificity, 
efficiency, and typically do not include a complete metagenomic analysis framework. 

Results: We present PathoScope 2.0, a complete bioinformatics framework for rapidly and accurately quantifying 
the proportions of reads from individual microbial strains present in metagenomic sequencing data from 
environmental or clinical samples. The pipeline performs all necessary computational analysis steps; including 
reference genome library extraction and indexing, read quality control and alignment, strain identification, and 
summarization and annotation of results. We rigorously evaluated PathoScope 2.0 using simulated data and data 
from the 201 1 outbreak of Shiga-toxigenic Escherichia coli O104:H4. 

Conclusions: The results show that PathoScope 2.0 is a complete, highly sensitive, and efficient approach for 
metagenomic analysis that outperforms alternative approaches in scope, speed, and accuracy. The PathoScope 
2.0 pipeline software is freely available for download at: http://sourceforge.net/projects/pathoscope/. 



Background 

The rapid and accurate characterization of microbial 
flora in clinical or environmental samples is critical for 
many applications including understanding the role of 
the microbiome in human health, personalized responses 
to acute or chronic infections, and early detection in 
disease outbreaks. With the steadily increasing number 
of microbial genomes available in public data repositories, 
metagenomic characterization using high-throughput se- 
quencing techniques can be used to catalogue microbes 
co-habituating in human systems [1] and to rapidly identify 
pathogens responsible for infectious disease outbreaks 
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[2-4]. This proliferation of metagenomic sequence data has 
resulted in the development of novel analytical approaches. 
Feature selection approaches exploit features from genomic 
patterns or composition [5,6], preserved sequence segments 
[7-9] or predetermined clade markers [10,11]. Assembly- 
based methods [12-15] have recently gained in popularity 
due to their increased sensitivity for strain identification. 
However, these approaches can suffer from issues of specifi- 
city, efficiency, and typically do not include a complete 
metagenomic analysis framework with reference library 
generation, read quality control, and reporting. 

Here we present a complete framework for rapid and 
accurate metagenomic profiling at the subspecies level 
that overcomes the limitations of other approaches. In 
our previous work, we developed a statistical algorithm 
(formerly PathoScope 1.0 and henceforth denoted as the 
PathoID module of the PathoScope 2.0 framework) to 
reassign ambiguously aligned sequencing reads and 



o 



Bion/led Central 



© 201 4 Hong et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http;//creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain 
Dedication waiver (httpy/creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, 
unless otherwise stated. 



Hong ef al. Microbiome 2014, 2:33 
http://www.microbiomejoumal.eom/content/2/1/33 



Page 2 of 15 



accurately estimate read proportions from each genome 
in the sample [16]. In PathoScope 2.0, we have intro- 
duced performance improvements to the PathoID algo- 
rithm such as better utilization of alignment scores, the 
inclusion of reference length and alignment quality into 
the reassignment model, and the addition of flexible 
priors for the read proportions and ambiguity penalties. 
In addition, PathoScope 2.0 extends the PathoID module 
into a complete workflow for analyzing data from clin- 
ical or environmental sequencing samples, with novel 
modules that: (1) automatically extract custom reference 
genome libraries for microbial or host genomes (PathoLib); 
(2) construct reference indices, align reads, and filter reads 
that align to the host (PathoMap); (3) conduct complete, 
parallel read quality processing (PathoQC); (4) annotate 
all sequences in the reference library with information 
such as organism name, taxonomic lineage, and gene loci 
(PathoDB); and (5) provide detailed reports on organisms, 
read coverage, genes, and gene products identified in the 
study (PathoReport). The modular PathoScope 2.0 frame- 
work minimizes interaction for users with weaker computa- 
tional backgrounds, while providing experienced users the 
flexibility to conduct analysis steps outside of the pipeline 
and to develop new plug-in modules. 

Results and discussion 

New features in Pathoscope 2.0 

With the introduction of PathoScope 2.0, we improve our 
previous PathoID algorithm for strain identification [16] 
and we extend our software into a complete framework for 
the analysis of metagenomic sequencing data. PathoScope 
2.0 (Figure 1) currentiy consists of four core pipeline mod- 
ules (PathoLib, PathoMap, PathoID, and PathoReport) and 
two optional 'plug-in' modules (PathoDB and PathoQC), 
with capability for seamless interaction with other metage- 
nomic tools and for the development of future plug-in 
modules to increase the usability of the pipeline. Below are 
detailed descriptions of the novel features that are intro- 
duced in PathoScope 2.0: 

PathoLib: Automatic reference library extraction 

The careful selection of a refined reference sequence 
library is crucial for all downstream analyses. The Patho- 
Lib module allows the user to automatically generate 
custom reference genome libraries for specific scenarios 
or datasets. The user supplies a set of NCBI taxonomy 
identification (taxID) numbers for organisms to be in- 
cluded in the library (Figure 2). The user can construct 
both a 'target library' (that is, pathogen genomes of interest) 
and a 'filter library' (for example, host genome or 
benign flora) for later use in the PathoMap module. 
The PathoLib module will extract all sequences in the 
NCBI nucleotide database associated with the taxIDs 
(for example, complete genomes, transcripts, plasmids, 



partially assembled fragments, and so on). In addition, 
if a high-level taxID is given (for example, kingdom, 
family, genus), PathoLib can also optionally extract all 
lower level sequences in the NCBI taxonomy tree. As 
PathoLib extracts the reference library, the NCBI Gen- 
elnfo number is linked to the taxID, and the taxID and 
organism name are appended to the sequence headers 
to further link sequences in downstream analyses. 

PathoMap: Efficient read alignment and filtering 

The PathoMap module aligns reads to the target library 
and removes any sequences that align with an equal or 
greater score to the filter library (Figure 3). Inputs for this 
module are the raw read file (FASTQ) and both the target 
and filter reference libraries (FASTA format). PathoMap 
will: (1) index the reference library (splitting the library into 
multiple indices if necessary); (2) align the reads to the 
target library; and (3) filter any of the target-matching reads 
that also match the filter library. The current version of 
PathoMap includes a Bowtie 2 [17] wrapper (see Figure 3) 
with predetermined optimal alignment parameters for dif- 
ferent read generation technologies (for example, lUumina: 
'-very-sensitive -k 100 -score-min L,-0.6,-0.6'; PacBio: 
'-very-sensitive -k 100 -score-min L, -0.6, -1.5'). The mod- 
ule also allows flexibility for the user to manually input 
Bowtie 2 parameters, or to conduct any part of the align- 
ments outside the PathoMap framework by supplying an 
alignment file in SAM format (Li et al., [18]). Finally, the 
module is constructed in a way that wrappers for additional 
alignment algorithms can easily be substituted for the Bow- 
tie 2 wrapper to accommodate diverse user preferences. 

PathoID: Reassignment of ambiguous reads 

The PathoID module (Figure 4) comprises our previous 
PathoScope 1.0 software, which utilizes a penalized stat- 
istical mixture model that reassigns all ambiguous reads 
to the most likely source genome in the library [16]. 
PathoID takes as input either a SAM or BLAST (format 
8) alignment file, and reassigns the reads to the most 
likely genome of origin. The PathoID module also in- 
cludes three important performance improvements to 
the original PathoScope approach: (1) improving the 
utilization of alignment scores; (2) including both reference 
length and read alignment quality into the reassignment 
model; and (3) the addition of user-defined priors for read 
proportions and ambiguity penalties (See Materials and 
Methods for details). These improvements increase the 
number of reads that are correctiy assigned to the source 
genome, reduce the number of false positive genomes iden- 
tified, and allow PathoID to better handle cases where the 
reference is not fully assembled or the sample contains 
multiple substrains of the same species. PathoID produces 
an updated alignment file of read reassignments and a 
summary report containing genome-level read proportions. 
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Genome Libraries 
(fasta) 



Quality Controlled 
Reads 
(fastq) 



Modules ■ 



1 . PathoDB 

Amysql database containing taxonomy, gene, and 
protein product annotation for all sequences in the 
NCBl nucleotide database. 

2. PathoLib 

Allows user to automatically generate custom reference 
genome libraries for specific scenarios ordatasets. 

3. PathoQC 

Performs several read quality control steps including 
trimming adapters, trimming low quality bases, and 
filtering low complexity reads. 

4. PathoMap 

Aligns reads to target reference genome library and 
removes sequences that align to the filter and host 
libraries. 

5. PatholD 

Reassigns ambiguous reads, identifies microbial strains 
present in the sample, and estimates proportions of 
reads from each genome. 

6. PathoReport 

Generates reports containing read proportions to each 
genome, organism lineage, gene loci, and protein 
products covered by the reads. 



PathoMap 



Filtered Read 
Alignments 



J 



5. PatholD 




Reassigned Read 
Alignments 




Detailed Report 
(xml) 

Figure 1 Workflow of the PathoScope 2.0 framework. PathoScope 2.0 consists of four core and two optional analysis modules for metagenomic 
profiling. Core modules: PathoLib extracts custom genome reference libraries, PathoMap aligns the reads to the reference library and filters host reads, 
PatholD identifies and estimates the proportions from each genome, and PathoReport provides detailed summary reports of the results. PathoDB provides 
additional annotation and PathoQC can be used to preprocess the reads prior to alignment. 



PathoReport: Detailed result reporting and annotation 

The PathoReport module (Figure 5) outputs two files from 
the pipeline. The first output file is a tab-delimited (.tsv) re- 
port that contains the genomes that were identified by the 
previous steps sorted by rank, along with high/low 



confidence read numbers and proportions assigned to each 
genome. The second file, in XML format, contains more 
detailed results, including the reads assigned to each gen- 
ome and contiguous sequences (contigs) constructed from 
overlapping reads. In addition, in concert with the plug-in 
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Step 1 a: Download nucleotide database (nt.fa) 



NCBI 



- nt.fa - 



>gi 1407466711 

AGCTTTTCATTCTGACTGCAACGG... 

> gi I 209395693 

TTTAAAGAGACCGGCGATTCTAGT... 

> gi I 9629367 

AGCGAAAGCAGGTCAATTATATTC... 



Step 1 b: Obtain organism, taxon ID, and gi relationships 



PathoDB 



NCBI 



org 



E. coli0104:H4 


1133853 


40746671 1 


E. coli0157:H7 


444450 


209395693 


E. coli0157:H7 


444450 


209157093 


Respiratory syncytial virus 


12814 


9629367 



Step 2: Append taxon ID and organism to all sequence headers (nt_ti.fa) 
' nt_ti.fa X 



> ti I 1 133853 I org | e_coli_0104:H4 | gi | 40746671 1 
AGCTTTTCATTCTGACTGCAACGG... 

> ti I 444450 I org I e_coli_01 57:H7 | gi | 209395693 
TTTAAAGAGACCGGCGATTCTAGT.. 

> ti I 1 281 4 I org | respiratory_syncytial_virus | gi | 9629367 
AGCGAAAGCAGGTCAATTATATTC... 



Step 3: Download NCBI taxonomy tree (nodes.dmp) 



- nodes.dmp 

ti 1 10239 I org | virus 
I 

ti 1 1 1 245 I org | pneumovirus 



ti I 2 I org | bacteria 

I 

ti I 562 I org I e_coli 



ti I 12814 I org I respiratory_synctial_virus ti 1 1 133853 | org | e_coli_0104:H4 ti | 444450 | org | e_coli_0157:H7 

V y 



step 4: User input 
taxon IDs 



Step 5: Extract target library 



10239 (all viruses) 



113385 (E.coliO104:H4) 



^ Target.fa 

>ti I 1133853 I org I e_coli_0104:H4 I gi 1407466711 
AGCTTTTCATTCTGACTGCAACGG... 

> ti 1 12814 I org I respiratory _syncytiaLvirus | gi | 9629367 
AGCGAAAGCAGGTCAATTATATTC... 
»^ 

Figure 2 PathoLib module workflow. The PathoLIb module will extract a reference library containing all genomes, chromosomes, transcripts, 
and other sequence fragments in the NCBI redundant nucleotide database associated user-defined taxonomic clade (NCBI taxlD). If a higher-level 
taxlD is given, PathoLib will optionally extract all sequences from lower-lever taxonomic designations based on the NCBI taxonomy tree. 



module PathoDB (described below), PathoReport will 
add additional annotation into the report such as 
organism lineage, gene loci, and protein products for 
genes covered by the reads. This XML output provides 
useful information for evaluating the quality of the 
results and facilitating downstream interpretation and 
analysis. For example, the specific reads assigned to a 
genome can be an important quality check for a meta- 
genomic analysis to check if the reads are low 



complexity or contain multiple PCR duplicates. The 
contigs show the breadth of genomic coverage, can 
identify sequence variation firom the reference, and facilitate 
scaffold-based genome assembly. The gene annotations 
identify the specific genes covered by the reads, can be use 
to annotation SNPs in specific genes, and (in RNA-seq 
studies) can identify which pathogenic genes are actively 
expressed. Examples of PathoReport XML files are given in 
Additional file 1. 
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J 



PathoLib 1 Target Alignment 
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Target Lib A 
(8.6 Gb) ■ 



Target Lib B 
(4.3 Gb) 



Split Library ' 
■ > 4.3 Gb 
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" (4.3 Gb) 
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" (4.3 Gb) 



Step 2: Read alignment 



Indices 
■ LibAI 



Alignment Output 
Bowtie2 >- sam 



_ Build Bowtie2 
Indices 




Filter Alignment 
(sam) 

Figure 3 PathoMap module workflow. The PathoMap module aligns reads to the target library and removes any sequences that align to the 
filter library. PathoMap will: (1) index the reference library; (2) align the reads to the target library; and (3) filter any of the target-matching reads 
that also match the filter library. The current version of PathoMap includes a Bovvtie 2 wrapper and allows users can conduct any part of the 
alignments outside the PathoMap framework. 



Optional plug-in modules 

The PathoScope 2.0 framework was constructed to easily 
add modules to the pipeline. Currently, there are two avail- 
able add-in modules: PathoQC and PathoDB. PathoQC is a 
read quality control module that performs several steps 



including trimming adapters, trimming low quality bases, 
and filtering low complexity sequences. PathoDB is a 
module consisting of taxonomy, gene, and protein 
product annotation for all sequences in the NCBI nu- 
cleotide database (Figure 6). In PathoDB, we have pre- 



PattioMap 



Alignment File (sam) 

// 1 9M3i3 I org I S_fj((£?iTj(j(.s_l'CLil27 1 gi 1 4186^372 



fi' 1 904543 I org | S_f;iidcniji.^_l'CLJI27 1 gi 1 41S6S57(I 



<i I 904343 I org I S_^'pii!enu^s_VCU127 \ gi \4136S371 



fj 1 1762^0 1 org I S_cp^d<:ni:is_ATCC | gi [ 274669JS 



^ Read Reassignment Algorilhm 

ti I 904343 I org | S_ei>kien}ih_VCU127 

■ czi 




) Reference Genome 



Step 2: Ambiguously aligned reads ar 
remapped to most probable 
genome of origin 



I Read with only one alignment 



» 



, Updated Alignment File (sam) 

li I 904343 I org \ S_epidcrmis_VCU127\gi 1 4136S372 



i I 904343 I org | S_ei>iderwi!._VCU127[gi 1 4136S370 



i 1 904343 I org \ S_cf>idcr)iiis_VCU127lgi \ 41S6S371 



■ I J 762S0 1 org | S_ci>idcrwis_ATCC \ gi \ 2746691S 



d with ambiguous alignment 



PathoReport 



Figure 4 PatholD module workflow. The PatholD utilizes a penalized statistical mixture model to reassign ambiguous reads to the most likely 
source genome. The PatholD module also includes performance improvements to the originally published reassignment approach (PathoScope 
1.0), by allowing users to calibrate prior information, better utilize alignment scores, and by including reference length and read alignment quality 
into the reassignment. 
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Reassigned Read 
Alignments 
(sam) 



PathoReport 
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• Read alignments 

• Coverage 
•Assembly contigs 

• Summary Report 



I 'Organism lineage | 
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Figure 5 PathoReport module workflow. The PathoReport 
module outputs two report files including: (1) a tab-delimited (tsv) 
report that contains a ranked list of genomes (with proportions) 
identified by the pipeline; and (2) an XML file containing detailed 
results including the reads assigned to each genome, contigs 
constructed from overlapping reads, and so on. 



compiled an annotation database containing informa- 
tion including organism name, lineage, NCBI Genlnfo 
identifier, gene and exon locations, and protein prod- 
ucts for all sequences in the NCBI nucleotide database. 
This information was assembled from 560 GB of anno- 
tation from the NCBI GenBank, RefSeq, and Third- 





r ^ 


GenBank 


RefSeq 



Third Party 
Annotation 



^ PathoDB. dmp 

For each in NCBI's nucleotide database: 

• Organism name 

• Organism lineage 

• NCBI gene info identifier 

• Gene and exon locations 

• Protein products 



Load data into mysql database 




PathoReportj 

Figure 6 PathoDB module workflow. PathoDB is an optional 
module of pre-compiled annotation for all sequences in the NCBI 
nucleotide database. The PathoDB module automatically interacts 
with PathoReport to provide additional annotation In the detailed 
(XML) report such as organism lineage, gene loci, and protein 
products for any genes covered by the reads. 



Party Annotation resources (ftp://ftp.ncbi.nih.gov/). 
The PathoDB module automatically interacts with 
PathoReport to provide additional annotation in the 
detailed XML report. The compiled database is avail- 
able for download from the PathoScope distribution 
webpage (http://sourceforge.net/projects/pathoscope/). In 
order to utilize PathoDB and automatically extract the data- 
base information in PathoReport, the user needs a MySQL 
client account or needs to set up a MySQL server. Because 
this requires several independent steps for installation, we 
have made this module optional. 

Future plug-in modules will be added, including add- 
itional aligner wrappers, modules for data visualization, 
and modules for 'post-diagnostic' variant calling, annota- 
tion, and genome assembly. 

Evaluation of PathoScope 2.0 on simulated and real-data 
examples 

We rigorously evaluated PathoScope 2.0 using two data- 
sets. The first dataset consisted of in silico genomic 
sequencing reads simulated from 25 strains of bacteria 
that are commonly found in humans which includes five 
strains of Escherichia coli, five strains of Staphylococcus 
aureus, five strains of Streptococcus pneumoniae, and 10 
other commonly occurring human bacterial strains. The 
second dataset contained clinical sequencing samples 
from the 2011 European outbreak of Shiga-toxigenic E. 
coli O104:H4. We demonstrate the improvements in 
PathoID 2.0 compared to PathoID version 1.0 in our 
simulation study of 25 bacterial strains. Using the clin- 
ical samples, PathoScope 2.0 is compared with other 
similar pipelines including PathoScope 1.0 for both 
accuracy of diagnosis and speed. 

Simulation study to evaluate improvements in PathoID 2.0 

Our simulated data consisted of five sets of 100,000 sim- 
ulated Illumina reads derived from each of the 25 strains 
of bacteria (see Methods). First, we processed the reads 
for each of the 25 bacterial strains individually using 
both PathoID version 1.0 and 2.0 with default parameter 
values and also PathoID 2.0 using a highly informative 
prior (see Methods). Although PathoID 1.0 was able to 
estimate the correct proportions of reads at the species 
level (100% to the particular species) for each of these 
25 samples, it was not able to estimate the correct pro- 
portions of reads at the strain level (100% to the particu- 
lar strain) for six samples (Additional file 2). In contrast, 
PathoID 2.0 using default parameters estimated the cor- 
rect proportions of reads at the strain level (100% to the 
particular strain) for all the 25 samples. PathoID 2.0 with 
an informative prior estimated the correct proportions 
of reads at the strain level (100% to the particular strain) 
for 24 of the 25 samples, but was unable to estimate the 
correct proportion for one sample with S. aureus N315. 
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Table 1 Simulation study results 





PatholD 1.0 




PatholD 2.0 




Organism 


Default 


Default 


ThetaPrior (Low) 


ThetaPrior (High) 


Bacteroides fragiiis 638R 


4.00% 


3.99% 


3.99% 


3.99%) 


Bifidobacterium bifidum BGN4 


4.00% 


3.99% 


3.99% 


3.99% 


Ciostridium perfringens ATCC 1 3 1 24 


3.99% 


3.99% 


3.99% 


3.99% 


Enterococcus faecaiis \/583 


3.98% 


3.99% 


3.99% 


4.00% 


Esclierictiia coii 042 


17.15% 


4.01% 


410% 


4.02% 


Escliericliia coii 55989 


0.57% 


0.50% 


1.51% 


3.83% 


Esctierictiia coii SEl 1 


0.28% 


10.10% 


7.07% 


3.74% 


Eschericiiia coii SEl 5 


071% 


3.43% 


3.73% 


3.82% 


Escherichia coii UMNK88 


1 .29% 


1 .95% 


3.59% 


4.16% 


Haemophiius infiuenzae 10810 


3.98% 


3.99% 


3.99% 


3.99% 


Neisseria meningitidis MC58 


3.25% 


3.99% 


3.99% 


3.92% 


Pseudomonas aeruginosa DK2 


4.00% 


3.99% 


3.99% 


3.99% 


Staphyiococcus epidermidis ATCC 1 2228 


3.97% 


3.98% 


3.98%) 


3.98% 


Streptococcus mitis B6 


2.94% 


3.82% 


3.87%) 


3.94% 


Streptococcus mutans UAl 59 


3.58% 


3.99% 


3.99%) 


3.99% 


Stapyiococcus aureus HO 5096 041 2 


0.31% 


1 .76% 


2.48%) 


3.80% 


Stapyiococcus aureus JKD6008 


0.05% 


1 5.82% 


13.40% 


3.79% 


Stapyiococcus aureus MRSA252 


0.69% 


1 46% 


2.14%) 


3.85% 


Stapyiococcus aureus N315 


0.00% 


0.68% 


1 .46%) 


1.01% 


Stapyiococcus aureus Newman 


0.15% 


0.33% 


0.55% 


3.48% 


Streptococcus pneumoniae 570-5B 


0.92% 


3.28% 


12.87% 


4.43% 


Streptococcus pneumoniae ATCC 700559 


0.29% 


0.99% 


2.10%) 


4.23% 


Streptococcus pneumoniae G54 


0.16% 


0.44% 


1.17%) 


3.35% 


Streptococcus pneumoniae Hungary 19A-5 


1 942% 


14.75% 


2.58%) 


4.33% 


Streptococcus pneumoniae Jaiwan] 9 F-14 


0.00% 


0.66% 


1 .34%) 


2.83%) 



Comparison of PatholD 1.0 to 2.0 using simulated data (see the section titled 'Simulation study to evaluate improvements in PatholD 2.0' for details). 



The results are consistent for all five sets of simulated 
samples. 

We also combined all the reads from the 25 strains, to 
create a dataset with all bacteria in equal proportions 
(expected proportion for each strain: 4%). Again, we 
applied both PatholD version 1.0 and 2.0 with default 
parameter values and added multiple informative priors 
(low, moderate, high; see Methods). We saw marked 
improvement in PatholD version 2.0 over version 1.0, 
including increased accuracy with informative priors 
(Table 1). For the 10 bacterial species that only included 
one strain per species, all methods performed well and 
estimated the correct read proportions. For the three 
species that contained multiple strains (£. coii, S. aureus, 
and S. pneumoniae), PatholD 1.0 and 2.0 at default pa- 
rameters were able to identify all of the strains present, 
but struggled to estimate the correct read proportions. This 
failure demonstrates the tendency of the PatholD algorithm 
(at default parameters) to identify a single strain for each 
species, and if multiple strains or substrains are present it 



may reassign too many of the reads to a single strain. This 
tendency is an advantage in cases where there is only one 
strain of each species in the sample, but it leads to inaccur- 
acies in the proportion estimates when multiple strains or 
substrains of the same species are present in the sample. 
The result of PatholD 2.0 with a highly informative prior 
matched closely with the expected proportions for 24 of 
the 25 strains, including 14 of the 15 cases with multiple 
strains of the same species. This demonstrates the value of 
using a highly informative prior when there are multiple 
strains of the same species in the sample, but we note that 
this comes at reduced effectiveness when there is only a 
single strain of each species in the sample. Finally, we re- 
peated our simulation study using multiple mixtures of the 
25 strains at random proportions, but the results were con- 
sistent with the mixtures at equal proportions, that is, the 
strains for which the proportions were not accurate with 
the mixtures at equal proportions matched with that of the 
mixtures at random proportions. Hence, we do not discuss 
them in detail here (see Additional file 2). 
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The one strain that was not estimated well using a 
highly informative prior was S. aureus N315, which had 
a final average read assignment percentage of 1.01% 
(Additional file 2). PathoID 2.0 with a highly informative 
prior incorrectly assigned the following average read per- 
centages to neighboring strains: Mu50: 0.8%, Mu3: 0.7%, 
TW20, JHl, and JH9: 0.4%, 04-02981 and T0131: 0.3%, 
ST228: 0.2%, USA300 TCH1516: 0.1% and USA300 
FPR3757: 0.1% (See Additional file 2). After further 
evaluation, we observed that PathoID failed with this 
strain due to the 'sequencing errors' in our simulated 
reads that caused some of the N315 reads to align more 
closely to the related strains. This phenomenon limited 
the ability of PathoID to correctly estimate the correct 
read proportions for this strain. 

Evaluation and comparison on clinical sequencing samples 

We selected samples from fecal specimens obtained 
from patients with diarrhea during the 2011 European 
outbreak of Shiga-toxigenic Escherichia coli (STEC) 
O104:H4 (NCBI accession: ERP001956). To demonstrate 
the flexibility of PathoLib with the E. coli data, we 
constructed a targeted library containing only E. coli 
subspecies (taxID: 562). We note that many sequences 
contained in this library are redundant, and include sep- 
arate sequence entries for complete genomes, individual 
chromosomes or plasmids, and distinct transcripts. This 
sequence redundancy allows PathoLib to identify the 
complete genome or chromosome for the strain if 
present, and also allows for the identification of the plas- 
mid or genes present in the case of a horizontal transfer. 

The updates to the PathoID algorithm increased the 
number of reads that were correctiy assigned to the source 
genome, reduced the number of false positive genomes 
identified, and allowed for improved identification of mul- 
tiple substrains in the same sample (Table 2). In addition, 
PathoScope 2.0 outperformed competing methods, such as 
RINS [12] and ReadScan [15], in computational efficiency, 
detection accuracy in terms of number of reads assigned to 
and overall ranking of STEC genomes, as well as in the in- 
terpretability of the results (Table 2). More detailed results 
from PathoScope (versions 1.0 and 2.0) and comparisons 
with two other near-complete pipeline methods, RINS and 
ReadScan, are detailed below. 

When comparing PathoID versions 1.0 and 2.0, we ob- 
served that version 2.0 assigned more reads to the STEC 
genome in 31 of the 40 STEC positive samples. In these 
40 samples, we observed mild to moderate increases in 
proportion of reads correctly reassigned, as well as 
reductions in number of incorrect genomes that were 
assigned reads. For example, for one of the samples 
(sample 3852), version 1.0 of the algorithm reassigned 
92.4% of the reads to the correct O104:H4 substrain. 
However, four other incorrect E. coli strains each 



received between 1% and 2% of the reads. In contrast, 
with the updated scoring approach, 99.99% of the reads 
were correctly reassigned to the proper substrain. In 
addition, the original scoring approach assigned read 
proportions above 0.2% (range: 0.2% to 2.2%) to seven 
different incorrect E. coli strains; whereas with the new 
scoring system, the maximum incorrect read propor- 
tion was approximately 20 to 200 times smaller at 
0.01%. The other 30 samples showed similar improve- 
ments (that is, shorter genome lists and more reads 
assigned to the STEC genome) using the updated 
scoring system. 

In addition, we considered one sample in more detail 
(sample 2535) because this sample contained a mixture 
of distinct E. coli strains. Interestingly, PathoID found 
O104:H4 at 39.5% and a few other strains in high pro- 
portions as well including DHl: 12.0%, 55989: 11.9%, 
BL21(DE3): 11.8% and 0127:H6 str. E2348/69: 8.4%. 
After processing this with a full bacterial reference 
library generated by PathoLIB, the sample still remained 
dominantly a mixture of E. coli strains, but PathoID did 
identify Ruminococcus sp. SRl/5 (6.9% of the reads) in 
the sample as well. 

We compared the performance of PathoScope (ver- 
sions 1.0 and 2.0), RINS, and ReadScan with these same 
STEC samples. The goal of the original study was to 
characterize STEC strains in a retrospective manner. In 
all 40 STEC positive samples, PathoScope (versions 1.0 
and 2.0) identified the correct STEC genome with vary- 
ing read assignment proportions (see Table 2). In 27 of 
the 40 samples, the STEC genome was the highest 
ranked genome, and in 35 samples the STEC genome 
was among the top three ranked genomes. In compari- 
son, in all 40 samples ReadScan was able to identify the 
Shiga-toxin producing plasmid, ranking the correct plas- 
mid first in 28 of the samples and in the top three hits 
for 33 samples. However, ReadScan could not discrimin- 
ate between specific E. coli strains, indicating that Read- 
Scan was very effective at identifying the pathogenic 
elements, but it could not identify the specific strains in 
which the plasmid resided because it does not connect 
plasmids, genes, or chromosomes from the same gen- 
ome in the reference library. This reduces ReadScan's 
ability to use unique plasmids and genes to identify the 
correct species/strains on the sample, particularly in 
cases where the pathogenic plasmid is transferred to a 
different bacterial strain or species. RINS was unable to 
accurately identify the correct STEC genome or plasmid 
in any of the samples. RINS employs a method of assem- 
bly of contigs from the reads and uses a local version of 
BLAST to classify contigs, which does not seem to work 
well when the reads are too small for assembly and 
when there are some read errors. RINS generated a large 
number of very small contigs that did not uniquely align 
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^Used different parameters for bowtie2 alignment: '-very-sensitive-local -score-min L,280,0.0 -k 10'. 
"^There are multiple sequencing runs for this sample, so the results were averaged across replicates. 

DNF: the algorithm did not finish in less than 24 h on a 16 cpu compute node; Nl: the STEC O104:H4 genome was not identified by the method. 

These are the results from the O104:H4 study - Positive Samples. See the section titled 'Evaluation and comparison on clinical sequencing samples' for details. 
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to any specific genome with these samples. When the 
reads are too short and the coverage is not deep enough 
for assembly, as in this case, discriminatory information 
can be lost by an assembly approach because some of 
the reads that carry discriminatory information reads 
were not included in the contigs. 

We also analyzed nine STEC negative samples from 
this dataset (Table 3). RINS did not score a STEC gen- 
ome or plasmid very highly in any of the samples, but 
the results were very consistent (that is, genome/plasmid 
rankings) with the STEC positive samples. ReadScan 
ranked the STEC plasmid as one of the two highest 
scoring hits for four of the eight samples. Similarly, 
PathoID 1.0 also estimated the STEC genome propor- 
tions to be more than 1.0% in four of the samples. In 
contrast, PathoID 2.0 only estimated one of the samples 
to have a 'high' STEC percentage (1.4%). We note that 
these false positives were due to the fact that these reads 
were only aligned to the E. coli reference library, and 
when we used the more general bacterial reference 
library, we found that PathoID 2.0 estimated the STEC 
percentage as close to zero for all the samples. 

In addition to being more accurate than the other 
methods, PathoScope also outperformed RINS and 
ReadScan in terms of the computation time needed to 
complete the analysis. On a cluster node with 16 CPUs 
and 256 GB of RAM, PathoScope 2.0 required an aver- 
age of 17 min of compute time, whereas ReadScan re- 
quired an average of 50 min and RINS required an 
average of more than 8 h for the 30 completed samples 
(10 of the samples did not finish in less than 24 h). 
PathoID version 1.0 was on average 3 min faster than 
PathoID version 2.0, but this computational time 
increase is almost entirely due to added features in ver- 
sion 2.0, such as better genome annotation, detailed re- 
sult reporting, and interaction with other PathoScope 



modules. We also repeated the experiment with a lar- 
ger target database containing all bacterial sequences. 
Only PathoScope 2.0 was able to finish this run 
successfully in a reasonable computational timeframe 
(<24 h per sample). 

Conclusions 

PathoScope 2.0 provides a complete modular bioinfor- 
matics workflow to analyze metagenomic sequence data 
from clinical or environmental samples. The pipeline 
helps researchers to efficiently generate custom refer- 
ence libraries, align reads to a target library, filter host 
reads, overcome read alignment ambiguity, characterize 
target diversity, and annotate results. Our simulated and 
real-data examples show that PathoScope 2.0 is a highly 
sensitive and efficient approach for metagenomic ana- 
lysis, without the need for computationally intensive 
database preprocessing and time-consuming de novo 
assembly. PathoScope 2.0 is a fast and modularized pipe- 
line for which we provide a comprehensive command 
line interaction so that more advanced users can select- 
ively run parts of the modules, but is user-friendly 
enough to be used by researchers with weaker computa- 
tional backgrounds. 

The libraries generated by PathoLib contain all avail- 
able reference genome sequences (for example, NCBI 
nucleotide) that meet the user-defined genome or taxID 
selection set. These sequences are often redundant, and 
may include separate entries for the complete genome 
sequences, individual chromosomes or plasmids, and 
distinct transcript sequences. Thus, for any given bacter- 
ial genome, chromosome, or plasmid, the reference 
library will likely contain all three elements in separate 
forms. PathoScope will then use the reads to identify the 
most likely source for the reads. For example, if the 
reads only align to the plasmid, then PathoScope will 



Table 3 Comparison of PathoScope 2.0 against alternatives with ELISA negative fecal samples 
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Nl: the STEC O104:H4 genome was not identified by the method. 

These are the results from the OW4:H4 study - Negative Samples. See the section titled 'Evaluation and comparison on clinical sequencing samples' for details. 
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select the plasmid as the source. If the reads align to 
both the plasmid and the surrounding sequence and/or 
neighboring chromosomes, the entire genome will 
likely be selected. In the E. coli example above, this 
allowed PathoScope to identify the entire STEC gen- 
ome of interest, whereas ReadScan was only able to 
identify the STEC plasmid. 

One possible limitation to the reference-based ap- 
proach used by PathoScope is that it relies on the gen- 
ome for each strain to be present in the library in order 
to achieve a precise identification. We note that PGR 
and microarray based approaches fail when the target is 
unknown as well, as we cannot make PGR primers when 
we do not know the sequence. In cases where the 
species is truly novel with no closely related sequenced 
genomes, PathoScope will not be able to identify the 
species/strain of origin for the reads. However, we have 
previously shown that when the specific strain is not in 
the reference, PathoID can successfully identify the near- 
est species/strain in the library [16]. For example, we 
showed that when the E. coli O104:H4 genome was not 
in the library, PathoID successfully identified the 55989 
strain, which has been previously established as the 
nearest sequenced strain to O104:H4 [16]. We observed 
a similar effect when we removed the correct F. tularen- 
sis strain from the library [16]. Therefore, based on these 
examples, we have demonstrated that PathoScope is a 
robust and useful tool for the majority of the metage- 
nomic studies. Furthermore, the modular framework of 
PathoScope 2.0 allows for additional assembly-based 
modules to be added to the framework that would be 
able to identify completely novel genomes. 

Methods 

Improvements to the PathoID reassignment model 

The PathoID module essentially comprises our previ- 
ously published PathoScope version 1.0 software [16]. 
PathoID inputs aligned sequencing reads in SAM format 
[18], and returns an updated alignment file of read reassign- 
ments (also SAM format) along with summary report con- 
taining read proportions assigned to each genome in the 
reference library. The modular PathoScope 2.0 framework 
introduces improvements in the original PathoID algo- 
rithm, as well as several novel functionalities that extend 
PathoID into a complete workflow for sequence-based 
metagenomic profiling. These include: (1) the automatic ex- 
traction of custom reference genome libraries (PathoLib); 
(2) the construction of reference indices (splitting large li- 
braries into separate indices, if necessary), the alignment of 
reads to the target libraries, and the filtering reads that align 
to the host or other filter libraries (PathoMap); (3) the pre- 
processing of reads with complete and parallel quality con- 
trol (PathoQC); (4) the annotation of all sequences in the 
reference library (PathoDB); and (5) detailed reports on 



organisms, reads, genes, and gene products identified in the 
study (PathoReport). 

PathoID utilizes a missing data mixture modeling ap- 
proach, where the template genome of origin is the 'missing 
data'. It integrates information from the read alignment 
with information obtained by borrowing strength across all 
reads from the sample to reassign ambiguous reads to 
the most likely source genome in the library. The 
PathoID likelihood contains a parameter that penalizes 
reads that align to multiple genomes; thus, increasing 
the impact of uniquely mapping reads on the reassign- 
ment result. Conversely, the previously published 
PathoID algorithm does not penalize reads relative to 
their 'best' read alignment, meaning that reads that 
align perfectly have the same influence on the result as 
reads whose best alignment contains base mismatches. 
Furthermore, in the PathoID version 1.0, reference 
genome length is also not expressly modeled in the 
likelihood, yielding an advantage for completed ge- 
nomes over smaller incomplete sequence fragments. In 
cases where the closest reference genome is incom- 
pletely assembled, PathoID might tend to identify a 
more genetically distant completed genome. Finally, 
although PathoID was initially developed under a 
Bayesian framework, the PathoID version 1.0 software 
does not allow users to easily modify the preset priors. 
This becomes extremely important in cases where 
there is more than one substrain of the same species 
present in the same sample. To accommodate these 
concerns, we introduced the following changes into 
version 2.0 of the PathoID reassignment model: 

Read alignment score 

To increase the influence of perfect match reads in 
PathoID, we implemented a weighted likelihood based ap- 
proach, which consists of weighting the reads in the log- 
likelihood based on their relative alignment likelihood 
(exponentiated alignment 'bit score'). A more general form 
of this weighted likelihood formulation has been rigorously 
evaluated and shown to share the same asymptotic features 
of the genuine likelihood function [19]. In the finite sample 
metagenomic context, the reads with a higher alignment 
scores will have more influence on the results than reads 
that align with more mismatches. 

Reference genome length 

We weight likelihood contributions for each alignment by 
the inverse of the length of a target genome. Similar 
approaches for adjusting for length in sequence alignments 
are well established, for example, BLAST 'E -value' [20]. 

Read alignment scores 

For most SAM alignment files (including Bowtie 2 align- 
ments), our original read assignment algorithm used the 
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SAM-MAPQ score to provide relative alignment prob- 
abilities. We now use the read alignment bit score (AS) 
from the Bowtie2 output. The AS is also standardized by 
adding the read length and normalizing the score to a 
fixed range so that the likelihood calculations of the EM 
algorithm do not exceed upper or lower computational 
precision limits. 

Flexible user-defined prior values 

The PathoID reassignment module was developed under 
a Bayesian framework that allows researchers to insert 
prior information into the genome identification ap- 
proach. This provides an option for the user to 'bias' the 
PathoID reassignment in cases where the researcher has 
a priori knowledge of the likely content of the metage- 
nomic sample. In addition, by assigning a prior value on 
the read ambiguity penalty, the user can modify the 
severity of the penalty placed on ambiguous reads. This 
becomes extremely important in cases where there are 
multiple substrains of the same species in the sample. 
With non-informative priors in this penalty parameter 
(default in version 1.0), PathoID has the tendency to 
search for a 'parsimonious' solution, often at the cost of 
assigning all non-unique reads to only one of the highly 
similar substrains present in the sample. By adding an 
informative prior to the penalty parameter, the PathoID 
reassignment will be less precise in its read assignments, 
but will more closely estimate the correct proportions 
assigned to the multiple substrains - usually at the cost 
of decreased accuracy of the unique strains in the sam- 
ple. In version 2.0 of the PathoID algorithm, we allow 
the user to modify the prior values placed on the gen- 
ome proportion and read penalty parameters. To deter- 
mine whether an informative prior is needed, we suggest 
that the user considers an alignment to a core genome 
region for species that might have multiple strains 
present in the sample. If there are multiple closely re- 
lated strains present in the sample then it is best to use 
highly informative prior value. On the contrary, if it ap- 
pears that there is one particular strain to be identified, 
then it is best to use default or low informative prior 
value. 

Simulation study details 

We generated a simulation study dataset to assess the 
performance of improvements made to PathoID in ver- 
sion 2.0 of the software. In particular, we wanted to 
evaluate the impact of informative prior information on the 
ability to accurately estimate genome proportions when 
multiple strains of the same species are present in the 
sample. To do this, we simulated sequencing reads firom 25 
strains of bacteria which includes five Escherichia coli 
strains (042, 55989, SEll, SE15 UMNK88), five Staphylo- 
coccus aureus strains (JKD6008, Newman, MRSA252, HO 



5096 0412, N315), five Streptococcus pneumoniae strains 
(670, ATCC_700669, G54, Hungaryl9A, Taiwanl9F) and 
10 other common bacterial strains {Bacteroides fragilis 
638R, Bifidobacterium bifidum BGN4, Clostridium 
perfringens ATCC 13124, Enterococcus faecalis V583, 
Haemophilus influenzae 10810, Neisseria meningitidis 
MC58, Pseudomonas aeruginosa DK2, Staphylococcus epi- 
dermidis ATCC 12228, Streptococcus mitis B6, Strepto- 
coccus mutans UA159). The phylogenetic relationships 
between these strains and other strains available in the 
NCBI database are given in Additional file 3. We used 
the Mason read simulator [21] to generate five sets of 
100,000 reads for each strain simulating 100 bp single-end 
sequencing reads using an 'Illumina-like' sequencing error 
model; Mason parameters: 'iUumina -s ## -N 100000 -sq -n 
100 -i -hs 0.0 -hi 0 -hnN -nN' (-s (Seeds) = 1101, 1102, 
1103, 1104, 1105). We used PathoLib (-t 2 -subTax) to gen- 
erate a reference library containing all bacteria. We used 
PathoMap (default parameters) to index and align the reads 
to the bacterial library. PathoMap automatically splits the 
bacterial library into smaller parts (<4.3 GB in size) that the 
Bowtie2 aligner can process and combines the alignment 
files together for the final results. We then applied PathoID 
(versions 1.0 and 2.0) to the simulated datasets. PathoID 
version 2.0 was applied with default parameters and with 
three informative priors (low, moderate, high). The low in- 
formative prior corresponds to '-thetaPrior 1000! moderate 
informative prior corresponds to '-thetaPrior 50000' and 
high informative prior corresponds to '-thetaPrior 10**88'. 
The thetaPrior value represents the number of non-unique 
reads that are not subject to reassignment. 

European E. coli outbreak samples 

We obtained data from the 2011 European outbreak of 
Shiga-toxigenic Escherichia coli O104:H4 [22]. The data- 
set consisted of 150 bp paired-end sequencing reads 
from fecal samples obtained from patients impacted 
by outbreak (NCBI accession number: ERP001956). We 
used reads from 21 sequencing runs originating from 19 
samples for which a standard enzyme-linked immuno- 
sorbent assay (ELISA) identified a bacterial shiga toxin. 
The data were pre-filtered for human nucleic or mito- 
chondrial reads and only included reads that have a 
'best-hit' alignment to bacterial taxon genomes [22]. We 
applied PathoScope 2.0, PathoScope 1.0, ReadScan, and 
RINS to the datasets using the parameters given in 
Additional file 4. Because of the nature of this study and 
to demonstrate the flexibility of PathoLib, we con- 
structed a target library containing all E. coli subspecies 
(taxID: 562) in the NCBI RefSeq nucleotide database 
(download date: January, 2013). PathoLib extracted 
46,640 sequence entries from a total 54 different E. coli 
strains. We used these samples to compare the improved 
performance of PathoScope 2.0 over PathoScope 1.0, 
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and have also included comparisons with two other 
near-complete pipeline methods, RINS and ReadScan. 
We recorded the rank of the E. coli O104:H4 genome, 
the highest-ranking pathogenic E. coli plasmid (Read- 
Scan only), the proportion of reads assigned to O104:H4 
(PathoID only), and the total computational time required. 
We also limited computational time to 24 h on a 16 cpu 
node (24 x 16 = 384 cpu hours) for each individual sample. 

PathoScope 2.0 software tutorial 

We provide a complete tutorial (Additional file 5) that 
covers the basics of using PathoScope to analyze metage- 
nomic samples. The tutorial provides a step-by-step pro- 
cedure to transform unintelligible sequencing reads into a 
meaningful picture of the microbial content present in a 
sample. 

Additional files 
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